Group 33 - Xinghao Huang - 81848509¶


Section 1: Data Descripsion¶

1. Descriptive Summary¶

  • There are 5280 observations in total (2653 observations in athens_weekdays.csv dataset, 2627 observations in athens_weekends.csv), and 19 variables (both datasets have the same variables, here variable Id is not counted).

  • Variable Summary:

Variable Type Description
realSum Quantitative data the total prices of the listing
room_type Categorical/nominal data different room types, including private, shared, entire home, apt.
room_shared Categorical/binary data whether a room is shared
room_private Categorical/binary data whether a room is private
person_capacity Quantitative data number of people a room can accommodate
host_is_superhost Categorical/binary data whether a host is a superhost
multi Categorical/binary data whether the listing is for multiple rooms
biz Categorical/binary data whether an observation is associated with a business
cleanliness_rating Quantitative data rating of cleanliness
guest_satisfaction_overall Quantitative data overall rating from guests comparing all listings offered by the host
bedrooms Quantitative data number of bedrooms
dist Quantitative data distance from city center
metro_dist Quantitative data distance from the nearest metro station
attr_index Quantitative data attr index
attr_index_norm Quantitative data normalized attr index
rest_index Quantitative data rest index
rest_index_norm Quantitative data normalized rest index
lng Quantitative data longitutde coordiates for location identification
lat Quantitative data latitutde coordiates for location identification

2. Source and Information¶

  • The datasets were originally obtained from Gyódi and Nawaro (2021), Determinants of Airbnb Prices in European Cities: A Spatial Econometrics Approach (supplementary material), published on Zenodo.

  • The data were collected from Airbnb listings across multiple European cities, focusing on listing attributes, host information, and spatial factors affecting pricing.

  • This dataset offers a detailed overview of Airbnb prices in Athens, including information on room type, cleanliness and satisfaction ratings, number of bedrooms, distance from the city centre, and other attributes that help explain price differences between weekday and weekend stays.

  • Citation: Gyódi, K., & Nawaro, Ł. (2021, March 25). Determinants of Airbnb prices in European cities: A Spatial Econometrics Approach (supplementary material). Zenodo. https://zenodo.org/records/4446043#.Y9Y9ENJBwUE

3. Preselection of Variables¶

  • room_shared, room_private, and multi have redundant information because we can also acquire the same and even more complete information from room_type and bedrooms.
  • lng and lat will be dropped because they only provide raw spatial coordinates, and information regarding distance can be acquired from dist and metro_dist
  • attr_index, attr_index_norm, rest_index, and rest_index_norm will also be dropped because their definitions and interpretations are unclear from the dataset documentation, and they seem like post-analysis results.

Section 2: Scientific Question¶

1. State the Question¶

  • Question: How is the Airbnb price in Athens associated with day type, room type, customer satisfaction, cleanliness rating, and location?
  • Specifically, I want to understand which of these factors has the strongest relationship with the Airbnb price.

2. Name the Response¶

  • The response variable is realSum (the Airbnb price in Athens).

3. Question Focus¶

  • My question mainly focuses on inference since it is about understanding the effects of room type, cleanliness rating, and location on customer satisfaction rather than predicting new outcomes

Section 3: Exploratory Data Analysis and Visualization¶

1. Reproducible Code¶

In [218]:
# load some libraries
library(ggplot2)
library(dplyr)
library(patchwork)


The two datasets have been uploaded from my local devices to the STAT 301 Workspace. The below provides how they can be loaded into R

In [219]:
# reading the file
athens_weekdays <- read.csv("/home/jovyan/work/stat-301/materials/Project/data/athens_weekdays.csv", header = TRUE)
athens_weekends <- read.csv("/home/jovyan/work/stat-301/materials/Project/data/athens_weekends.csv", header = TRUE)

# check if there are any missing values
sum(is.na(athens_weekends)) == 0
TRUE


Now, I will add a column indicating the type of day each observation is. It has 2 levels: Weekdays and Weekends. Then, the two datasets will be merged into one dataset called athens.

In [220]:
# add indicator columns to both
athens_weekdays <- athens_weekdays %>% mutate(day_type = as.factor("Weekdays"))
athens_weekends <- athens_weekends %>% mutate(day_type = as.factor("Weekends"))

# merge the two datasets into one
athens <- rbind(athens_weekdays, athens_weekends)
head(athens)
A data.frame: 6 × 21
XrealSumroom_typeroom_sharedroom_privateperson_capacityhost_is_superhostmultibizcleanliness_rating⋯bedroomsdistmetro_distattr_indexattr_index_normrest_indexrest_index_normlnglatday_type
<int><dbl><chr><chr><chr><dbl><chr><int><int><dbl>⋯<int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><fct>
10129.82448Entire home/aptFalseFalse4False0010⋯22.81396350.8818900 55.348572.086871 78.77838 5.91516023.7660037.98300Weekdays
21138.96375Entire home/aptFalseFalse4True 1010⋯10.40729290.3045679240.306659.060559407.1677030.57262923.7316837.97776Weekdays
32156.30492Entire home/aptFalseFalse3True 0110⋯11.23721110.2884881199.507377.522257395.9674029.73164223.7220037.97900Weekdays
43 91.62702Entire home/aptFalseFalse4True 1010⋯14.36745720.2974673 39.803051.500740 58.70658 4.40804723.7271238.01435Weekdays
54 74.05151Private room FalseTrue 2False0010⋯12.19418500.3852657 78.733402.968577113.32597 8.50920423.7339137.99529Weekdays
65113.88934Entire home/aptFalseFalse6True 1010⋯22.07120560.4538674 96.588993.641806158.6443211.91198123.7158437.98598Weekdays
In [221]:
summary(athens$realSum)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   42.88    98.66   127.72   151.74   171.54 18545.45 


Note that there are potentially extreme outliers in realSum. They will make it harder to see the pattern of the majority of individual observations. Therefore, I will filter them out to have a better view for the visualization.

The values within the whiskers, [Q1 - 1.5IQR, Q3 + 1.5IQR], are included.

In [222]:
# filter the data
realSum_within_range <- athens %>%
    group_by(room_type, day_type) %>%
    filter( (realSum >= quantile(realSum,0.25)-1.5*IQR(realSum)) & (realSum <= quantile(realSum,0.75)+1.5*IQR(realSum)) ) %>%
    ungroup() %>%
    select(realSum, day_type, room_type, dist) # these 4 variables will be used for the visualization(s)

head(realSum_within_range)
A tibble: 6 × 4
realSumday_typeroom_typedist
<dbl><fct><chr><dbl>
129.82448WeekdaysEntire home/apt2.8139635
138.96375WeekdaysEntire home/apt0.4072929
156.30492WeekdaysEntire home/apt1.2372111
91.62702WeekdaysEntire home/apt4.3674572
74.05151WeekdaysPrivate room 2.1941850
113.88934WeekdaysEntire home/apt2.0712056

2. Visualization¶

The below cell will generate boxplots that are faceted by room_type. day_type is encoded in the x-channel, realSum is encoded in the y-channel, and day_type is encoded in the fill-channel, with individual observations added.

In [223]:
# boxplot
box_price_by_room <- realSum_within_range %>%
    ggplot(aes(x = day_type, y = realSum, fill = day_type)) +
    geom_boxplot(fatten = 4) + # adjust the width of the median bar
    geom_jitter(color="gray", size=0.4, alpha=0.6) + # adding individual observations
    facet_grid(~room_type) + # facet by room_type
    ggtitle("Airbnb Prices Distribution per Room/Day Type") +
    labs(x = "Day Types", y = "Airbnb Price in Athens", fill = "Day Type")


The below cell will generate scatterplots that are faceted by room_type. dist is encoded in the x-channel, realSum is encoded in the y-channel, and room_type is encoded in the color-channel

In [226]:
# scatterplot
scatter_price_vs_dist <- realSum_within_range %>%
    ggplot(aes(x = dist, y = realSum, color = room_type))+
    geom_point() + 
    facet_grid(~room_type) + 
    ggtitle("Scatterplot of Airbnb Price of Each Room Type and Distance from City Center ") +
    labs(x = "Distance from City Center", y = "Aribnb Price in Athens", color = "Room Type")


The below cell will concatenate the two plots into one

In [227]:
# concatenate two plots into one
options(repr.plot.width = 14, repr.plot.height = 6) # resize the plot
box_price_by_room + scatter_price_vs_dist
No description has been provided for this image

3. Interpretations¶

Explain why you consider this plot relevant to address your question or to explore the data.¶

  • The whole plot visualizes how Airbnb prices in Athens vary across day type, room type, and distance from the city center.
  • The boxplot shows the price distribution by room and day type, while the scatterplot explores how location is associated with price for each room category.
  • Together, they directly address my question by incorporating key variables, such as day_type, room_type, realSum, and dist, to examine both categorical and spatial influences on Airbnb pricing.

Interpret briefly the results obtained.¶

  • The Entire home/apt category has the most listings and the highest prices overall, with a wider spread compared to Private room and Shared room, suggesting greater price variation in full apartments.
  • There is no strong distinction between weekday and weekend prices within each room type, indicating that daily demand fluctuations may not heavily affect Airbnb pricing in Athens.
  • Most Airbnb listings are close to the city center, where a wide range of prices exists, implying that location alone may not fully explain price differences among listings.

What do you learn from your visualization?¶

  • Isolating the effect of room type will be essential in later inference stages, as each room type shows a distinct price distribution. Without doing so, I may encounter issues such as Simpson’s Paradox, which could lead to misleading conclusions when combining groups.
  • Because these visualizations exclude extreme outliers in realSum for clarity, the future analyses should consider their impact, which may largely change the results or increase variability, affecting the reliability of model estimates.
In [ ]: